1.1   Getting to know your data

Install and load in the libraries and data we need for this section:

# Set your working directory by clicking on the top menu:
# Session > Set Working Directory > To Source File Location

# Install packages
install.packages("dplyr")

# Load in libraries
library(dplyr)

# If you are want to read the information for a function, type 1 question mark in front of the function name:
?read.csv

# If you are want to know which package a function belongs to, type 2 question marks in front of the function name:
??read.csv

# Load in data
raw_data <- read.csv("data/raw_data.csv") 

We have several tools to get to know the data

Try out the following commands to get to know the data?:

# How many entries does the data frame have?
View(raw_data)

# What are the names of the first 3 columns?
names(raw_data)

# What are the dimensions of our data?
dim(raw_data)

# Which species have more cases? What is the mean age of the organisms infected?
summary(raw_data)

# In which region did the 1st case occur?

head(raw_data)

# In which region did the last case occur?
tail(raw_data)

# Which variables are numbers (num)?
str(raw_data)

#What types of species do we have in the data?
unique(raw_data$species)

# What is the first argument for the function names?
?names()

1.2   Data subsetting and summarising

During this workshop, we will use functions available in the dplyr package to subset and summarise data.


dplyr

dplyr functions can use the %>% (pipe) operator to chain together objects/functions. This passes the output of one function directly into the next. It can be helpful to ‘stack’ multiple functions without creating multiple visible outputs. You’ll see this in use in the following examples.


1.2.1 Subsetting

Subsetting is commonly used in R to select data that you would like to use. The select function can be used to select columns.

Compare the results of using the function normally vs. with the pipe operator:

select(raw_data, Age, Region)

raw_data %>% 
  select(Age, Region)

The filter function can be used to select rows based on their values.

To select the observations you want it is useful to know some comparison operators

What do each of these lines of code filter the data for?

raw_data %>% 
  filter(region %in% c("Mara", "Pwani", "Dar-es-salaam"))

raw_data %>% 
  filter(age >= 30)

These outputs can be saved as an object, exactly as you normally would.

Store your output table as an object:

# Save output as an object
subsetted_data <- raw_data %>% 
  filter(region %in% c("Mara", "Pwani", "Dar-es-salaam"))

# Print table
subsetted_data


1.2.2 Summarising

Sometimes, you may want to work with summaries of your data. The summarise function can be used to calculate summaries of variables in your data.

What do each of the following filters summarise?

raw_data %>% 
  summarise(n_males = length(which(Sex=="M")))

raw_data %>% 
  summarise(total_age = sum(Age))

As mentioned earlier, dplyr functions can be stacked using the %>% (pipe) operator. For example, the summarise function can be combined with group_by to summarise variables by one or more columns.

How are these two tables different?

raw_data %>% 
  group_by(Sex) %>% 
  summarise(n_records = length(sex))

raw_data %>% 
  group_by(Region, Sex) %>% 
  summarise(total_age = sum(age))


1.2.3 Summary

Fill in the blanks for the following lines in your R script

# Subset for only records with a dog 
raw_data %>%
  ___(species=="dog")

# Subset for humans, and summarise the mean age per region
raw_data %>%
  ___(species=="human") %>%
  group_by(___) %>%
  ___(mean_age = ___(age))

1.3   Build some exploratory plots


library(ggplot2)
library(lubridate)
library(leaflet)

Basic barplot in ggplot2

ggplot() + 
  geom_bar(data=raw_data, aes(x=sex), fill=col_palette[1]) + 
  theme_classic()

More complicated barplot (time series) in ggplot2

# Set start and end dates for time series
raw_data$date <- as.Date(raw_data$date)
ts_start <- as.Date(paste0(substr(min(raw_data$date),1,7), "-01"))
ts_end <- ceiling_date(max(raw_data$date),'month')
ts_breaks <- seq(ts_start, ts_end, by="month")

# Subset data
male_data <- raw_data %>% filter(sex=="M")
female_data <- raw_data %>% filter(sex=="F")

# Use histogram function to summarise numbers for each month
ts_male <- hist(male_data$date, plot=FALSE, breaks=ts_breaks)
ts_female <- hist(female_data$date, plot=FALSE, breaks=ts_breaks)

# Create a data frame containing the time series data
ts_data <- data.frame(date = rep(ts_breaks[1:length(ts_breaks)-1], 2),
                      sex = c(rep("Male", length(ts_male$counts)), 
                              rep("Female", length(ts_female$counts))),
                      n = c(ts_male$counts, ts_female$counts))

# Plot
ggplot() + 
  geom_col(data=ts_data, aes(x=date, y=n, fill=sex)) + 
  labs(x="Date", y="Number") + 
  scale_fill_manual(name="Gender", values=col_palette[1:2]) + 
  theme_classic()

Map in Leaflet

# Create a new column specifying domestic vs. wildlife vs. human
raw_data$species_type[which(raw_data$species=="dog" | raw_data$species=="cat")] <- "Domestic"
raw_data$species_type[which(raw_data$species=="jackal" | raw_data$species=="lion")] <- "Wildlife"
raw_data$species_type[which(raw_data$species=="human")] <- "Human"

# Use only one year
leaflet_data <- raw_data %>% 
  mutate(year = substr(date, 1,4)) %>% 
  filter(year == 2014)

# Setup point colours using the colorFactor() function
leaflet_pal <- colorFactor(palette=col_palette[1:3], domain = unique(leaflet_data$species_type))

# Plot
leaflet() %>%
  addPolygons(data=region_shp, weight=1, color="black", fillColor = "white", fillOpacity=1) %>%
  addCircleMarkers(data=leaflet_data, lng=~x, lat=~y, color=~leaflet_pal(species_type),
                   radius=3, opacity = 1, fillOpacity=1, label=~species)

1.4   Introducing rShiny